# Combining fields
df['TITLE_CLEAN'] = df['TITLE_CLEAN'].fillna('unknown').astype(str).str.strip().str.lower()
df['SOFTWARE_SKILLS_NAME'] = df['SOFTWARE_SKILLS_NAME'].fillna('').astype(str).str.lower()
df['SPECIALIZED_SKILLS_NAME'] = df['SPECIALIZED_SKILLS_NAME'].fillna('').astype(str).str.lower()
# Combine text fields for TF-IDF
df['combined_text'] = df['TITLE_CLEAN'] + ' ' + df['SOFTWARE_SKILLS_NAME'] + ' ' + df['SPECIALIZED_SKILLS_NAME']Unsupervised Learning Model
Data Processing
Text Preprocessing: Combine Job Title and Skills into a Single Field for TF-IDF
- We combined the job title and skills into a single text field (combined_text) to create a richer, unified input for the TF-IDF vectorizer. This improves the quality of feature extraction by capturing more context about each job, enabling better clustering and analysis.
Unique Value Counts in Job Titles and Skill Fields
Unique values in 'TITLE_CLEAN': 27266
Unique values in 'SOFTWARE_SKILLS_NAME': 22456
Unique values in 'SPECIALIZED_SKILLS_NAME': 41462
- It helps to assess the diversity and granularity of job titles and skill mentions before clustering or vectorization, which is important for understanding feature richness and potential noise in the dataset.
NLP + K Means Clustering
Text Vectorization and Feature Scaling for Clustering
#| eval: true
#| echo: false
# Vectorize
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf.fit_transform(df['combined_text']).toarray()
# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_tfidf) We performed text vectorization and feature scaling, which are essential preprocessing steps before clustering
Tfidf Vectorizer converts the cleaned job and skills text (combined_text) into a numeric matrix based on word importance (TF-IDF), enabling text-based clustering.
StandardScaler scales the TF-IDF features to have zero mean and unit variance, which is important because KMeans is sensitive to feature magnitudes.
KMeans Clustering and Evaluation with NAICS 6-Digit Labels
Adjusted Rand Index (NAICS_2022_6_NAME): 0.009
Normalized Mutual Info Score (NAICS_2022_6_NAME): 0.033
Evaluate Clustering Using Multiple Reference Labels (NAICS, SOC, ONET)
#| eval: true
#| echo: false
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
reference_labels = ['NAICS_2022_6_NAME', 'SOC_2021_5_NAME', 'ONET_NAME']
results = []
for label in reference_labels:
df_eval = df[[label, 'cluster']].dropna()
ari = adjusted_rand_score(df_eval[label], df_eval['cluster'])
nmi = normalized_mutual_info_score(df_eval[label], df_eval['cluster'])
results.append({'Reference Label': label, 'ARI': ari, 'NMI': nmi})| Reference Label | ARI | NMI | |
|---|---|---|---|
| 0 | NAICS_2022_6_NAME | 0.0092 | 0.0331 |
| 1 | SOC_2021_5_NAME | 0.0000 | 0.0000 |
| 2 | ONET_NAME | 0.0000 | 0.0000 |
NAICS_2022_6_NAME has the highest agreement with clusters (though still very low), suggesting a slight alignment with industry-based classification.
SOC and ONET labels have zero alignment — meaning the clusters derived from TF-IDF features of job titles + skills do not correspond to occupation-based taxonomies.
Visualize TF-IDF-Based Clusters with PCA and Plotly
Interpretation:
The three clusters (color-coded) are distinct in the PCA space, suggesting that the clustering algorithm was able to differentiate based on text patterns.
This model is capturing textual similarity (e.g., shared tools, terms, or phrasing in job descriptions), not necessarily formal job classifications.
Cluster boundaries are data-driven, not taxonomy-aligned.
Top Terms Representing Each Cluster (TF-IDF Feature Importance)
Cluster 0:
pmi, apple, institute, ios, android, vmware, desktop, methodology, expectation, zachman, windows, infrastructure, capability, operating, subcontracting
Cluster 1:
data, language, programming, sql, intelligence, python, tableau, analysis, dashboard, bi, power, statistics, visualization, analyst, analytics
Cluster 2:
sap, enterprise, consultant, applications, oracle, functional, management, planning, cloud, architect, architecture, solution, design, erp, resource
Nomenclature of our Clusters
Cluster 0 = “IT Infrastructure & Support”
Cluster 1 = “Data Analytics & BI”
Cluster 2 = “Enterprise Applications & Consulting”
Visualizing Representative Job Titles Across Clusters
Cluster 0 (“IT Infrastructure & Support”)
→ Jobs like enterprise support analyst, senior IT analyst, data integration analyst, IT enterprise architect.
→ These titles are support, IT system maintenance, integration, and architecture focused.
Cluster 1 (“Data Analytics & BI”)
→ Jobs like sr BI analyst, data analyst, data scientist, data research analyst.
→ Heavy analytics, business intelligence (BI), data science skills — matches perfectly.
Cluster 2 (“Enterprise Applications & Consulting”)
→ Jobs like SAP BTP consultant, ERP integrations analyst, applications consultant, product architect.
→ Clearly related to enterprise software (SAP, ERP) and consulting roles.
Preprocessing Software Skills for Analysis
Clustered Software Skill Visualization
Interpretation
Cluster 0 (“IT Infrastructure & Support”)
Common skills: Microsoft Excel, Microsoft SharePoint, Docusign, SAP Applications, TOGAF, automated cost tools.
→ These tools are typical for IT operations, documentation, system architecture support.
Cluster 1 (“Data Analytics & BI”)
Common skills: Python, SQL (PL/SQL), Looker, Tableau, Power BI, Google Analytics, Qlik Sense.
→ Clear focus on analytics, data visualization, and programming languages.
Cluster 2 (“Enterprise Applications & Consulting”)
Common skills: SAP Sales and Distribution, Google Cloud Platform (GCP), Microsoft OneNote, IBM Maximo.
→ These are enterprise-level software systems for consulting, ERP, and large infrastructure projects.
Conclusion:
The software skills distribution perfectly matches the previously assigned cluster themes based on job titles and top terms.
Average salary per cluster
Average salary in Cluster 0: $132,148.75
Average salary in Cluster 1: $105,813.00
Average salary in Cluster 2: $127,433.09
Salary Distribution Across Clusters
Interpretation for the Boxplot
Boxplot shows the salary distribution across three clusters derived from unsupervised KMeans clustering on job titles and skills:
Cluster 0:
- Has a relatively high median salary (~$125K) and moderate spread, suggesting roles with consistent mid-to-high pay (e.g., enterprise or management roles).
Cluster 1:
- Has the lowest median salary (~$95K) with many outliers, indicating entry-to-mid-level roles with high variance (e.g., data or analyst roles).
Cluster 2:
Shows the widest salary range with the highest outliers (up to $500K), implying this cluster contains senior or highly specialized roles (e.g., consultants or architects).
Overall, the plot reflects meaningful salary differences between the clusters, supporting the relevance of clustering for job role segmentation.
Conclusion and Key Takeaways
we applied KMeans clustering to job postings using text data from job titles and associated skills (software + specialized). Despite relatively low alignment with external classification labels like SOC, and ONET (as shown by ARI and NMI scores), our analysis still uncovered distinct, interpretable clusters with practical insights.
Clustering Pipeline Summary
Text Preprocessing: Combined job title, software, and specialized skills into a combined_text field.
Vectorization: Used TfidfVectorizer to convert text to numerical features.
Scaling: Applied StandardScaler to normalize TF-IDF vectors.
Clustering: Ran KMeans with k=3 clusters.
Evaluation:
ARI (Adjusted Rand Index): Max ~0.009 with NAICS_2022_6_NAME
NMI (Normalized Mutual Info): Max ~0.033
Interpretation: Clusters do not align well with predefined industry/occupation codes, which is expected in unsupervised learning.
Key Visual Insights
1. PCA Projection
The PCA plot revealed clear separation between clusters, indicating the clustering algorithm did find structural patterns in job descriptions.
2. Top Terms per Cluster
Cluster 0: Keywords like apple, ios, vmware, infrastructure suggest tech roles focused on devices, systems, and IT frameworks.
Cluster 1: Terms like sql, tableau, python, analysis indicate data-related roles (analysts, BI, data scientists).
Cluster 2: Words like sap, oracle, consultant, planning suggest enterprise solutions, consultants, or ERP specialists.
3. Sample Job Titles by Cluster
Confirms term-based interpretations:
Cluster 0: IT infrastructure & support
Cluster 1: Data analysts and BI roles
Cluster 2: SAP/ERP consultants and architects
4. Software Skills by Cluster
Cluster 0: Excel, SharePoint, PowerPoint – general office + support tools
Cluster 1: Python, Tableau, Power BI – analytics & data tools
Cluster 2: SAP, Oracle, GCP – enterprise software and cloud tools
5. Salary Distribution by Cluster
Cluster 0: Mid-range salaries, low outliers – stable IT roles
Cluster 1: Lower median salaries, wide spread – junior data roles
Cluster 2: High median and extreme outliers – senior consultants & architects
Final Takeaways
Even with low overlap to government taxonomies (NAICS/SOC/ONET), clustering successfully revealed latent role patterns based on real-world skills and job titles.
Unsupervised clustering can meaningfully group job postings by functional role, skill stack, and salary range, offering powerful segmentation for:
Career recommendation systems
Skill gap analysis
Compensation benchmarking
Targeted recruitment strategies